Biostat 203B Homework 2

Due Feb 9 @ 11:59PM

Author

Jiyin (Jenny) Zhang, UID: 606331859

Display machine information for reproducibility:

sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.6.4

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.4 compiler_4.3.2    fastmap_1.1.1     cli_3.6.2        
 [5] tools_4.3.2       htmltools_0.5.7   rstudioapi_0.15.0 yaml_2.3.8       
 [9] rmarkdown_2.25    knitr_1.45        jsonlite_1.8.8    xfun_0.41        
[13] digest_0.6.33     rlang_1.1.2       evaluate_0.23    

Load necessary libraries.

library(arrow)
library(data.table)
library(memuse)
library(pryr)
library(R.utils)
library(tidyverse)
# Load additional libraries
library(dplyr)
library(duckdb)
# install.packages("readr")
library(readr)
# install.packages("styler")
library(styler)

Here display memory information of my computer.

memuse::Sys.meminfo()
Totalram:  8.000 GiB 
Freeram:   1.549 GiB 

In this exercise, we explore various tools for ingesting the MIMIC-IV data introduced in homework 1.

Display the contents of MIMIC hosp and icu data folders:

ls -l ~/mimic/hosp/
total 8859752
-rw-rw-r--@ 1 zhangjiyin  staff    15516088 Jan  5  2023 admissions.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff      427468 Jan  5  2023 d_hcpcs.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff      859438 Jan  5  2023 d_icd_diagnoses.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff      578517 Jan  5  2023 d_icd_procedures.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff       12900 Jan  5  2023 d_labitems.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff    25070720 Jan  5  2023 diagnoses_icd.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff     7426955 Jan  5  2023 drgcodes.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff   508524623 Jan  5  2023 emar.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff   471096030 Jan  5  2023 emar_detail.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff     1767138 Jan  5  2023 hcpcsevents.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff  1939088924 Jan  5  2023 labevents.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff    96698496 Jan  5  2023 microbiologyevents.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff    36124944 Jan  5  2023 omr.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff     2312631 Jan  5  2023 patients.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff   398753125 Jan  5  2023 pharmacy.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff   498505135 Jan  5  2023 poe.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff    25477219 Jan  5  2023 poe_detail.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff   458817415 Jan  5  2023 prescriptions.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff     6027067 Jan  5  2023 procedures_icd.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff      122507 Jan  5  2023 provider.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff     6781247 Jan  5  2023 services.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff    36158338 Jan  5  2023 transfers.csv.gz
ls -l ~/mimic/icu/
total 6155968
-rw-rw-r--@ 1 zhangjiyin  staff       35893 Jan  5  2023 caregiver.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff  2467761053 Jan  5  2023 chartevents.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff       57476 Jan  5  2023 d_items.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff    45721062 Jan  5  2023 datetimeevents.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff     2614571 Jan  5  2023 icustays.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff   251962313 Jan  5  2023 ingredientevents.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff   324218488 Jan  5  2023 inputevents.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff    38747895 Jan  5  2023 outputevents.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff    20717852 Jan  5  2023 procedureevents.csv.gz

Q1. read.csv (base R) vs read_csv (tidyverse) vs fread (data.table)

Q1.1 Speed, memory, and data types

There are quite a few utilities in R for reading plain text data files. Let us test the speed of reading a moderate sized compressed csv file, admissions.csv.gz, by three functions: read.csv in base R, read_csv in tidyverse, and fread in the data.table package.

Which function is fastest? Is there difference in the (default) parsed data types? How much memory does each resultant dataframe or tibble use? (Hint: system.time measures run times; pryr::object_size measures memory usage.)

Answer:

Define the path of the file admissions.csv.gz as follows:

file_path <- "~/mimic/hosp/admissions.csv.gz"

Read CSV using read.csv in base R.

base_r_time <- system.time(base_r_data <- read.csv(file_path, header = TRUE))
base_r_time
   user  system elapsed 
  3.413   0.051   3.464 

Read CSV using read_csv in tidyverse:

tidyverse_time <- system.time(
  tidyverse_data <- read_csv(file_path, show_col_types = FALSE)
)
tidyverse_time
   user  system elapsed 
  1.396   0.742   0.712 

Read CSV using fread in data.table:

data_table_time <- system.time(data_table_data <- fread(file_path))
data_table_time
   user  system elapsed 
  0.468   0.128   0.648 

Compare the runtimes of the three functions:

cat("Base R Time:", base_r_time[3], "seconds\n")
Base R Time: 3.464 seconds
cat("Tidyverse Time:", tidyverse_time[3], "seconds\n")
Tidyverse Time: 0.712 seconds
cat("Data.table Time:", data_table_time[3], "seconds\n")
Data.table Time: 0.648 seconds

Therefore, the fastest function is fread in data.table package.

Here compares the parsed data types of the three dataframes or tibbles and shows some differences in the parsed data types:

Here is the parsed data types of the base R dataframe:

str(base_r_data)
'data.frame':   431231 obs. of  16 variables:
 $ subject_id          : int  10000032 10000032 10000032 10000032 10000068 10000084 10000084 10000108 10000117 10000117 ...
 $ hadm_id             : int  22595853 22841357 25742920 29079034 25022803 23052089 29888819 27250926 22927623 27988844 ...
 $ admittime           : chr  "2180-05-06 22:23:00" "2180-06-26 18:27:00" "2180-08-05 23:44:00" "2180-07-23 12:35:00" ...
 $ dischtime           : chr  "2180-05-07 17:15:00" "2180-06-27 18:49:00" "2180-08-07 17:50:00" "2180-07-25 17:55:00" ...
 $ deathtime           : chr  "" "" "" "" ...
 $ admission_type      : chr  "URGENT" "EW EMER." "EW EMER." "EW EMER." ...
 $ admit_provider_id   : chr  "P874LG" "P09Q6Y" "P60CC5" "P30KEH" ...
 $ admission_location  : chr  "TRANSFER FROM HOSPITAL" "EMERGENCY ROOM" "EMERGENCY ROOM" "EMERGENCY ROOM" ...
 $ discharge_location  : chr  "HOME" "HOME" "HOSPICE" "HOME" ...
 $ insurance           : chr  "Other" "Medicaid" "Medicaid" "Medicaid" ...
 $ language            : chr  "ENGLISH" "ENGLISH" "ENGLISH" "ENGLISH" ...
 $ marital_status      : chr  "WIDOWED" "WIDOWED" "WIDOWED" "WIDOWED" ...
 $ race                : chr  "WHITE" "WHITE" "WHITE" "WHITE" ...
 $ edregtime           : chr  "2180-05-06 19:17:00" "2180-06-26 15:54:00" "2180-08-05 20:58:00" "2180-07-23 05:54:00" ...
 $ edouttime           : chr  "2180-05-06 23:30:00" "2180-06-26 21:31:00" "2180-08-06 01:44:00" "2180-07-23 14:00:00" ...
 $ hospital_expire_flag: int  0 0 0 0 0 0 0 0 0 0 ...

Here is the parsed data types of the tidyverse tibble:

str(tidyverse_data)
spc_tbl_ [431,231 × 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ subject_id          : num [1:431231] 1e+07 1e+07 1e+07 1e+07 1e+07 ...
 $ hadm_id             : num [1:431231] 22595853 22841357 25742920 29079034 25022803 ...
 $ admittime           : POSIXct[1:431231], format: "2180-05-06 22:23:00" "2180-06-26 18:27:00" ...
 $ dischtime           : POSIXct[1:431231], format: "2180-05-07 17:15:00" "2180-06-27 18:49:00" ...
 $ deathtime           : POSIXct[1:431231], format: NA NA ...
 $ admission_type      : chr [1:431231] "URGENT" "EW EMER." "EW EMER." "EW EMER." ...
 $ admit_provider_id   : chr [1:431231] "P874LG" "P09Q6Y" "P60CC5" "P30KEH" ...
 $ admission_location  : chr [1:431231] "TRANSFER FROM HOSPITAL" "EMERGENCY ROOM" "EMERGENCY ROOM" "EMERGENCY ROOM" ...
 $ discharge_location  : chr [1:431231] "HOME" "HOME" "HOSPICE" "HOME" ...
 $ insurance           : chr [1:431231] "Other" "Medicaid" "Medicaid" "Medicaid" ...
 $ language            : chr [1:431231] "ENGLISH" "ENGLISH" "ENGLISH" "ENGLISH" ...
 $ marital_status      : chr [1:431231] "WIDOWED" "WIDOWED" "WIDOWED" "WIDOWED" ...
 $ race                : chr [1:431231] "WHITE" "WHITE" "WHITE" "WHITE" ...
 $ edregtime           : POSIXct[1:431231], format: "2180-05-06 19:17:00" "2180-06-26 15:54:00" ...
 $ edouttime           : POSIXct[1:431231], format: "2180-05-06 23:30:00" "2180-06-26 21:31:00" ...
 $ hospital_expire_flag: num [1:431231] 0 0 0 0 0 0 0 0 0 0 ...
 - attr(*, "spec")=
  .. cols(
  ..   subject_id = col_double(),
  ..   hadm_id = col_double(),
  ..   admittime = col_datetime(format = ""),
  ..   dischtime = col_datetime(format = ""),
  ..   deathtime = col_datetime(format = ""),
  ..   admission_type = col_character(),
  ..   admit_provider_id = col_character(),
  ..   admission_location = col_character(),
  ..   discharge_location = col_character(),
  ..   insurance = col_character(),
  ..   language = col_character(),
  ..   marital_status = col_character(),
  ..   race = col_character(),
  ..   edregtime = col_datetime(format = ""),
  ..   edouttime = col_datetime(format = ""),
  ..   hospital_expire_flag = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

Here is the parsed data types of the data.table dataframe:

str(data_table_data)
Classes 'data.table' and 'data.frame':  431231 obs. of  16 variables:
 $ subject_id          : int  10000032 10000032 10000032 10000032 10000068 10000084 10000084 10000108 10000117 10000117 ...
 $ hadm_id             : int  22595853 22841357 25742920 29079034 25022803 23052089 29888819 27250926 22927623 27988844 ...
 $ admittime           : POSIXct, format: "2180-05-06 22:23:00" "2180-06-26 18:27:00" ...
 $ dischtime           : POSIXct, format: "2180-05-07 17:15:00" "2180-06-27 18:49:00" ...
 $ deathtime           : POSIXct, format: NA NA ...
 $ admission_type      : chr  "URGENT" "EW EMER." "EW EMER." "EW EMER." ...
 $ admit_provider_id   : chr  "P874LG" "P09Q6Y" "P60CC5" "P30KEH" ...
 $ admission_location  : chr  "TRANSFER FROM HOSPITAL" "EMERGENCY ROOM" "EMERGENCY ROOM" "EMERGENCY ROOM" ...
 $ discharge_location  : chr  "HOME" "HOME" "HOSPICE" "HOME" ...
 $ insurance           : chr  "Other" "Medicaid" "Medicaid" "Medicaid" ...
 $ language            : chr  "ENGLISH" "ENGLISH" "ENGLISH" "ENGLISH" ...
 $ marital_status      : chr  "WIDOWED" "WIDOWED" "WIDOWED" "WIDOWED" ...
 $ race                : chr  "WHITE" "WHITE" "WHITE" "WHITE" ...
 $ edregtime           : POSIXct, format: "2180-05-06 19:17:00" "2180-06-26 15:54:00" ...
 $ edouttime           : POSIXct, format: "2180-05-06 23:30:00" "2180-06-26 21:31:00" ...
 $ hospital_expire_flag: int  0 0 0 0 0 0 0 0 0 0 ...
 - attr(*, ".internal.selfref")=<externalptr> 

Here compares the memory usage:

cat("Base R Memory Usage:", pryr::object_size(base_r_data), "bytes\n")
Base R Memory Usage: 158710640 bytes
cat("Tidyverse Memory Usage:", pryr::object_size(tidyverse_data), "bytes\n")
Tidyverse Memory Usage: 55309384 bytes
cat("Data.table Memory Usage:", pryr::object_size(data_table_data), "bytes\n")
Data.table Memory Usage: 50129376 bytes

Q1.2 User-supplied data types

Re-ingest admissions.csv.gz by indicating appropriate column data types in read_csv. Does the run time change? How much memory does the result tibble use? (Hint: col_types argument in read_csv.)

Answer:

Based on the parsed data types from Q1.1, explicitly specify column types:

col_types <- cols(
  admission_type = col_character(),
  admit_provider_id = col_character(),
  admission_location = col_character(),
  discharge_location = col_character(),
  insurance = col_factor(),
  language = col_character(),
  marital_status = col_character(),
  race = col_character(),
  subject_id = col_integer(),
  hadm_id = col_integer(),
  hospital_expire_flag = col_double(),
  admittime = col_datetime(format = ""),
  dischtime = col_datetime(format = ""),
  deathtime = col_datetime(format = ""),
  edregtime = col_datetime(format = ""),
  edouttime = col_datetime(format = "")
)

Read CSV with explicit column types:

tidyverse_data_explicit_types <- read_csv(file_path, col_types = col_types)

Measure runtime and memory usage:

# Measure runtime
tidyverse_time_explicit_types <- system.time(
  tidyverse_data_explicit_types <- read_csv(file_path, col_types = col_types)
)
# Display runtime and memory usage
cat(
  "Tidyverse Time (Explicit Types):",
  tidyverse_time_explicit_types[3],
  "seconds\n"
)
Tidyverse Time (Explicit Types): 0.66 seconds
cat(
  "Tidyverse Memory Usage (Explicit Types):",
  pryr::object_size(tidyverse_data_explicit_types),
  "bytes\n"
)
Tidyverse Memory Usage (Explicit Types): 50135824 bytes

Therefore, the run time does change compared with the previous 0.682 seconds. The memory usage is listed above.

Q2. Ingest big data files

Let us focus on a bigger file, labevents.csv.gz, which is about 125x bigger than admissions.csv.gz.

ls -l ~/mimic/hosp/labevents.csv.gz
-rw-rw-r--@ 1 zhangjiyin  staff  1939088924 Jan  5  2023 /Users/zhangjiyin/mimic/hosp/labevents.csv.gz

Display the first 10 lines of this file except the header row.

zcat < ~/mimic/hosp/labevents.csv.gz | head -11
labevent_id,subject_id,hadm_id,specimen_id,itemid,order_provider_id,charttime,storetime,value,valuenum,valueuom,ref_range_lower,ref_range_upper,flag,priority,comments
1,10000032,,45421181,51237,P28Z0X,2180-03-23 11:51:00,2180-03-23 15:15:00,1.4,1.4,,0.9,1.1,abnormal,ROUTINE,
2,10000032,,45421181,51274,P28Z0X,2180-03-23 11:51:00,2180-03-23 15:15:00,___,15.1,sec,9.4,12.5,abnormal,ROUTINE,VERIFIED.
3,10000032,,52958335,50853,P28Z0X,2180-03-23 11:51:00,2180-03-25 11:06:00,___,15,ng/mL,30,60,abnormal,ROUTINE,NEW ASSAY IN USE ___: DETECTS D2 AND D3 25-OH ACCURATELY.
4,10000032,,52958335,50861,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,102,102,IU/L,0,40,abnormal,ROUTINE,
5,10000032,,52958335,50862,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,3.3,3.3,g/dL,3.5,5.2,abnormal,ROUTINE,
6,10000032,,52958335,50863,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,109,109,IU/L,35,105,abnormal,ROUTINE,
7,10000032,,52958335,50864,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,___,8,ng/mL,0,8.7,,ROUTINE,MEASURED BY ___.
8,10000032,,52958335,50868,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,12,12,mEq/L,8,20,,ROUTINE,
9,10000032,,52958335,50878,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,143,143,IU/L,0,40,abnormal,ROUTINE,
10,10000032,,52958335,50882,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,27,27,mEq/L,22,32,,ROUTINE,

Q2.1 Ingest labevents.csv.gz by read_csv

Try to ingest labevents.csv.gz using read_csv. What happens? If it takes more than 5 minutes on your computer, then abort the program and report your findings.

Answer:

# Define file path
labevents_file_path <- "~/mimic/hosp/labevents.csv.gz"

Try reading labevents.csv.gz using read_csv:

labevents_data <- read_csv(labevents_file_path)

I waited for 5 minutes and the code did not finish running. Therefore, I aborted the program. The reason is that my memory size is 8GB and I expect this labevents.csv.gz file is too large to fit into my memory.

Q2.2 Ingest selected columns of labevents.csv.gz by read_csv

Try to ingest only columns subject_id, itemid, charttime, and valuenum in labevents.csv.gz using read_csv. Does this solve the ingestion issue? (Hint: col_select argument in read_csv.)

Answer:

Try to ingest only columns specified above:

labevents_data <- read_csv(
  labevents_file_path,
  col_select = c(subject_id, itemid, charttime, valuenum)
)

This didn’t solve the ingestion issue. The code did not finish running after 5 minutes. Therefore, I aborted the program. The reason is also that my memory size is 8GB and I expect this labevents.csv.gz file is still too large to fit into my memory.

Q2.3 Ingest subset of labevents.csv.gz

Our first strategy to handle this big data file is to make a subset of the labevents data. Read the MIMIC documentation for the content in data file labevents.csv.

In later exercises, we will only be interested in the following lab items: creatinine (50912), potassium (50971), sodium (50983), chloride (50902), bicarbonate (50882), hematocrit (51221), white blood cell count (51301), and glucose (50931) and the following columns: subject_id, itemid, charttime, valuenum. Write a Bash command to extract these columns and rows from labevents.csv.gz and save the result to a new file labevents_filtered.csv.gz in the current working directory. (Hint: use zcat < to pipe the output of labevents.csv.gz to awk and then to gzip to compress the output. To save render time, put #| eval: false at the beginning of this code chunk.)

Display the first 10 lines of the new file labevents_filtered.csv.gz. How many lines are in this new file? How long does it take read_csv to ingest labevents_filtered.csv.gz?

Answer:

Print some lines of the original file labevents.csv.gz to understand its structure.

zcat < ~/mimic/hosp/labevents.csv.gz | head -3
labevent_id,subject_id,hadm_id,specimen_id,itemid,order_provider_id,charttime,storetime,value,valuenum,valueuom,ref_range_lower,ref_range_upper,flag,priority,comments
1,10000032,,45421181,51237,P28Z0X,2180-03-23 11:51:00,2180-03-23 15:15:00,1.4,1.4,,0.9,1.1,abnormal,ROUTINE,
2,10000032,,45421181,51274,P28Z0X,2180-03-23 11:51:00,2180-03-23 15:15:00,___,15.1,sec,9.4,12.5,abnormal,ROUTINE,VERIFIED.
zcat < ~/mimic/hosp/labevents.csv.gz | 
  awk -F, '{if($5 == 50912 || $5 == 50971 || $5 == 50983 || $5 == 50902 || $5 == 50882 || $5 == 51221 || $5 == 51301 || $5 == 50931) \
  print $2, $5, $7, $10}' | 
  gzip > labevents_filtered.csv.gz
zcat < labevents_filtered.csv.gz | head -10
10000032 50882 2180-03-23 11:51:00 27
10000032 50902 2180-03-23 11:51:00 101
10000032 50912 2180-03-23 11:51:00 0.4
10000032 50971 2180-03-23 11:51:00 3.7
10000032 50983 2180-03-23 11:51:00 136
10000032 50931 2180-03-23 11:51:00 95
10000032 51221 2180-03-23 11:51:00 45.4
10000032 51301 2180-03-23 11:51:00 3
10000032 51221 2180-05-06 22:25:00 42.6
10000032 51301 2180-05-06 22:25:00 5

Here displays the number of lines in this new subsetted file:

zcat < labevents_filtered.csv.gz | wc -l
 24855909

Here measures the runtime:

# Define file path
labevents_filtered_file_path <- "./labevents_filtered.csv.gz"
# Measure runtime
labevents_filtered_data <- system.time(
  read_csv(labevents_filtered_file_path, show_col_types = FALSE)
)
labevents_filtered_data
   user  system elapsed 
 35.930  62.988 128.861 

Q2.4 Ingest labevents.csv by Apache Arrow

Our second strategy is to use Apache Arrow for larger-than-memory data analytics. Unfortunately Arrow does not work with gz files directly. First decompress labevents.csv.gz to labevents.csv and put it in the current working directory. To save render time, put #| eval: false at the beginning of this code chunk.

Then use open_dataset to ingest labevents.csv, select columns, and filter itemid as in Q2.3. How long does the ingest+select+filter process take? Display the number of rows and the first 10 rows of the result tibble, and make sure they match those in Q2.3. (Hint: use dplyr verbs for selecting columns and filtering rows.)

Write a few sentences to explain what is Apache Arrow. Imagine you want to explain it to a layman in an elevator.

Answer:

First decompress labevents.csv.gz to labevents.csv and put it in the current working directory.

gzip -dk < ~/mimic/hosp/labevents.csv.gz > ./labevents.csv

Here displays how long the ingest+select+filter process takes:

labevents_arrow <- system.time({
  labevents_dataset <- arrow::open_dataset("labevents.csv", format = "csv")
  labevents_filtered_arrow <- labevents_dataset %>%
    select(subject_id, itemid, charttime, valuenum) %>%
    filter(itemid %in%
      c(50912, 50971, 50983, 50902, 50882, 51221, 51301, 50931))
})
labevents_arrow
   user  system elapsed 
  0.051   0.020   0.087 

Here display the number of rows and the first 10 rows of the result tibble, and they match those in Q2.3.

labevents_filtered_arrow_collected <- labevents_filtered_arrow %>%
  collect()
cat("Number of Rows:", nrow(labevents_filtered_arrow_collected), "\n")
Number of Rows: 24855909 
labevents_filtered_arrow_collected %>%
  arrange(subject_id) %>%
  head(10) %>%
  mutate(charttime = format(charttime, tz = "UTC", usetz = TRUE))
# A tibble: 10 × 4
   subject_id itemid charttime               valuenum
        <int>  <int> <chr>                      <dbl>
 1   10000032  50882 2180-03-23 11:51:00 UTC     27  
 2   10000032  50902 2180-03-23 11:51:00 UTC    101  
 3   10000032  50912 2180-03-23 11:51:00 UTC      0.4
 4   10000032  50971 2180-03-23 11:51:00 UTC      3.7
 5   10000032  50983 2180-03-23 11:51:00 UTC    136  
 6   10000032  50931 2180-03-23 11:51:00 UTC     95  
 7   10000032  51221 2180-03-23 11:51:00 UTC     45.4
 8   10000032  51301 2180-03-23 11:51:00 UTC      3  
 9   10000032  51221 2180-05-06 22:25:00 UTC     42.6
10   10000032  51301 2180-05-06 22:25:00 UTC      5  

what is Apache Arrow:

Instead of bundling up information in a big, complex package, Arrow breaks it down into smaller, organized pieces. These pieces are easy to understand and can be quickly sorted, searched, and shared by different computer systems.

Q2.5 Compress labevents.csv to Parquet format and ingest/select/filter

Re-write the csv file labevents.csv in the binary Parquet format (Hint: write_dataset.) How large is the Parquet file(s)? How long does the ingest+select+filter process of the Parquet file(s) take? Display the number of rows and the first 10 rows of the result tibble and make sure they match those in Q2.3. (Hint: use dplyr verbs for selecting columns and filtering rows.)

Write a few sentences to explain what is the Parquet format. Imagine you want to explain it to a layman in an elevator.

Answer: First re-write the csv file in the binary Parquet format:

write_dataset(labevents_dataset, "labevents.parquet")

Here displays the size of the Parquet file(s):

file.size("labevents.parquet")
[1] 96

Read the Parquet file, select columns and filter rows and measure how long the ingest+select+filter process of the Parquet file(s) takes:

labevents_parquet <- system.time({
  labevents_parquet_dataset <- open_dataset("labevents.parquet")
  labevents_filtered_parquet <- labevents_parquet_dataset %>%
    select(subject_id, itemid, charttime, valuenum) %>%
    filter(itemid %in%
      c(50912, 50971, 50983, 50902, 50882, 51221, 51301, 50931))
})
labevents_parquet
   user  system elapsed 
  0.133   0.012   0.149 

Here display the number of rows and the first 10 rows and they match those in Q2.3.

labevents_filtered_parquet_collected <- labevents_filtered_parquet %>%
  collect()
cat(
  "Number of Rows (Parquet):", 
  nrow(labevents_filtered_parquet_collected), 
  "\n"
    )
Number of Rows (Parquet): 24855909 
labevents_filtered_parquet_collected %>%
  arrange(subject_id) %>%
  head(10) %>%
  mutate(charttime = format(charttime, tz = "UTC", usetz = TRUE))
# A tibble: 10 × 4
   subject_id itemid charttime               valuenum
        <int>  <int> <chr>                      <dbl>
 1   10000032  50882 2180-03-23 11:51:00 UTC     27  
 2   10000032  50902 2180-03-23 11:51:00 UTC    101  
 3   10000032  50912 2180-03-23 11:51:00 UTC      0.4
 4   10000032  50971 2180-03-23 11:51:00 UTC      3.7
 5   10000032  50983 2180-03-23 11:51:00 UTC    136  
 6   10000032  50931 2180-03-23 11:51:00 UTC     95  
 7   10000032  51221 2180-03-23 11:51:00 UTC     45.4
 8   10000032  51301 2180-03-23 11:51:00 UTC      3  
 9   10000032  51221 2180-05-06 22:25:00 UTC     42.6
10   10000032  51301 2180-05-06 22:25:00 UTC      5  

what is the Parquet format:

It’s a way to organize and store data so that it takes up less room on your computer, making everything more efficient. Think of it as a super-organized backpack where you can neatly pack all your information, and when you need something, it’s super quick to find, like magic! Parquet helps data be compact and easy to manage, which is great for handling lots of information without making your computer feel cluttered.

Q2.6 DuckDB

Ingest the Parquet file, convert it to a DuckDB table by arrow::to_duckdb, select columns, and filter rows as in Q2.5. How long does the ingest+convert+select+filter process take? Display the number of rows and the first 10 rows of the result tibble and make sure they match those in Q2.3. (Hint: use dplyr verbs for selecting columns and filtering rows.)

Write a few sentences to explain what is DuckDB. Imagine you want to explain it to a layman in an elevator.

Answer:

Read the Parquet file, select columns and filter rows. Then, measure how long the ingest+convert+select+filter process of the Parquet file(s) takes:

labevents_duckdb <- system.time({
  # Ingest the Parquet file
  labevents_parquet_dataset <- open_dataset("labevents.parquet")
  # Convert it to a DuckDB table:
  labevents_duckdb_table <- labevents_parquet_dataset %>%
    to_duckdb()
  labevents_duckdb_table_filtered <- labevents_duckdb_table %>%
    select(subject_id, itemid, charttime, valuenum) %>%
    filter(itemid %in%
      c(50912, 50971, 50983, 50902, 50882, 51221, 51301, 50931))
})
labevents_duckdb
   user  system elapsed 
  0.213   0.051   0.296 

Here display the number of rows and the first 10 rows. They match those in Q2.3 exactly.

labevents_duckdb_table_filtered_collected <- labevents_duckdb_table_filtered %>%
  collect()

cat("Number of Rows:", nrow(labevents_duckdb_table_filtered_collected), "\n")
Number of Rows: 24855909 
labevents_duckdb_table_filtered_collected %>%
  arrange(subject_id) %>%
  head(10)
# A tibble: 10 × 4
   subject_id itemid charttime           valuenum
        <dbl>  <dbl> <dttm>                 <dbl>
 1   10000032  50882 2180-03-23 11:51:00     27  
 2   10000032  50902 2180-03-23 11:51:00    101  
 3   10000032  50912 2180-03-23 11:51:00      0.4
 4   10000032  50971 2180-03-23 11:51:00      3.7
 5   10000032  50983 2180-03-23 11:51:00    136  
 6   10000032  50931 2180-03-23 11:51:00     95  
 7   10000032  51221 2180-03-23 11:51:00     45.4
 8   10000032  51301 2180-03-23 11:51:00      3  
 9   10000032  51221 2180-05-06 22:25:00     42.6
10   10000032  51301 2180-05-06 22:25:00      5  

DuckDB is a special type of database that helps organize and quickly find information, making it perfect for handling large amounts of data. Imagine you have a giant bookshelf with data books, and DuckDB is like the librarian who can swiftly locate the exact book you need, making it a quick and efficient way to manage and retrieve information. It’s especially great for tasks where speed and agility in handling data are crucial.

Q3. Ingest and filter chartevents.csv.gz

chartevents.csv.gz contains all the charted data available for a patient. During their ICU stay, the primary repository of a patient’s information is their electronic chart. The itemid variable indicates a single measurement type in the database. The value variable is the value measured for itemid. The first 10 lines of chartevents.csv.gz are

zcat < ~/mimic/icu/chartevents.csv.gz | head -11
subject_id,hadm_id,stay_id,caregiver_id,charttime,storetime,itemid,value,valuenum,valueuom,warning
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220179,82,82,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220180,59,59,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220181,63,63,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220045,94,94,bpm,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220179,85,85,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220180,55,55,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220181,62,62,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220210,20,20,insp/min,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220277,95,95,%,0
10000032,29079034,39553978,66056,2180-07-23 19:00:00,2180-07-23 19:59:00,220045,97,97,bpm,0

d_items.csv.gz is the dictionary for the itemid in chartevents.csv.gz.

zcat < ~/mimic/icu/d_items.csv.gz | head -10
itemid,label,abbreviation,linksto,category,unitname,param_type,lownormalvalue,highnormalvalue
220001,Problem List,Problem List,chartevents,General,,Text,,
220003,ICU Admission date,ICU Admission date,datetimeevents,ADT,,Date and time,,
220045,Heart Rate,HR,chartevents,Routine Vital Signs,bpm,Numeric,,
220046,Heart rate Alarm - High,HR Alarm - High,chartevents,Alarms,bpm,Numeric,,
220047,Heart Rate Alarm - Low,HR Alarm - Low,chartevents,Alarms,bpm,Numeric,,
220048,Heart Rhythm,Heart Rhythm,chartevents,Routine Vital Signs,,Text,,
220050,Arterial Blood Pressure systolic,ABPs,chartevents,Routine Vital Signs,mmHg,Numeric,90,140
220051,Arterial Blood Pressure diastolic,ABPd,chartevents,Routine Vital Signs,mmHg,Numeric,60,90
220052,Arterial Blood Pressure mean,ABPm,chartevents,Routine Vital Signs,mmHg,Numeric,,

In later exercises, we are interested in the vitals for ICU patients: heart rate (220045), mean non-invasive blood pressure (220181), systolic non-invasive blood pressure (220179), body temperature in Fahrenheit (223761), and respiratory rate (220210). Retrieve a subset of chartevents.csv.gz only containing these items, using the favorite method you learnt in Q2.

Document the steps and show code. Display the number of rows and the first 10 rows of the result tibble.

Answer:

I’d like to use the Apache Arrow method in Q2.4 to retrieve a subset of chartevents.csv.gz

First, I will decompress chartevents.csv.gz and save the result to a new file chartevents.csv in the current working directory.

gzip -dk < ~/mimic/icu/chartevents.csv.gz > ./chartevents.csv

Then I will use arrow::open_dataset to ingest chartevents.csv and filter itemid that we are interested in: heart rate (220045), mean non-invasive blood pressure (220181), systolic non-invasive blood pressure (220179), body temperature in Fahrenheit (223761), and respiratory rate (220210).

# Ingest the data
chartevents_dataset <- arrow::open_dataset("chartevents.csv", format = "csv")
# Filter the data
chartevents_filtered_arrow <- chartevents_dataset %>%
  filter(itemid %in% c(220045, 220181, 220179, 223761, 220210)) %>%
  collect()

Here display the number of rows and the first 10 rows.

cat("Number of Rows:", nrow(chartevents_filtered_arrow), "\n")
Number of Rows: 22502319 
head(chartevents_filtered_arrow, 10)
# A tibble: 10 × 11
   subject_id  hadm_id  stay_id caregiver_id charttime          
        <int>    <int>    <int>        <int> <dttm>             
 1   10000032 29079034 39553978        47007 2180-07-23 14:01:00
 2   10000032 29079034 39553978        47007 2180-07-23 14:01:00
 3   10000032 29079034 39553978        47007 2180-07-23 15:00:00
 4   10000032 29079034 39553978        47007 2180-07-23 15:00:00
 5   10000032 29079034 39553978        47007 2180-07-23 15:00:00
 6   10000032 29079034 39553978        47007 2180-07-23 15:00:00
 7   10000032 29079034 39553978        66056 2180-07-23 12:00:00
 8   10000032 29079034 39553978        66056 2180-07-23 12:00:00
 9   10000032 29079034 39553978        66056 2180-07-23 12:00:00
10   10000032 29079034 39553978        66056 2180-07-23 12:00:00
# ℹ 6 more variables: storetime <dttm>, itemid <int>, value <chr>,
#   valuenum <dbl>, valueuom <chr>, warning <int>